Session 3 covered:
R markdown
Writing functions
Apply (again)
Installing packages
Using packages
Real life example: DESeq2
Session 3 covered:
R markdown
Writing functions
Apply (again)
Installing packages
Using packages
Real life example: DESeq2
# Functions multiply = function(x, y) return(x * y) multiply(3, 2) # Default arguments multiply = function(x, y=2) return(x * y) z = multiply(3) # Scopes multiply = function(x, y) z <<- x * y multiply(3, 2) # Passing through arguments apply(m, 1, multiply, y=2)
# Packages from CRAN
install.packages("ggplot2")
# Packges from Bioconductor
library(BiocManager)
install("DESeq2")
# Referencing package functions
DESeq2::plotPCA()
What did you learn?
Do we need to recap any parts?
Session 4
What makes a good figure?
Colour
viridis colour library
Plotting with ggplot2
Aesthetics
Geometries
Themes
Facets
Saving
Humans are visual creatures!
Visualisation of large data sets is an essential task in molecular biology and medicine. When done effectively, images can help you to explain the most complex of data.
Without an image or summarisation, how well would you do finding significant genes passing a given fold change threshold in a table like this?
## baseMean log2FoldChange lfcSE pvalue padj ## PENG0000000001 1.1941808 -0.11077611 0.46558089 1.822729e-01 1.000000e+00 ## PENG0000000002 0.1394510 -0.01310322 0.42899817 8.004145e-01 1.000000e+00 ## PENG0000000004 0.7393672 -0.01877427 0.39858386 7.883606e-01 1.000000e+00 ## PENG0000000006 44.3468742 0.08682659 0.19961999 4.041232e-01 6.352165e-01 ## PENG0000000007 788.9630161 1.65624481 0.20412453 2.062683e-18 1.994904e-16 ## PENG0000000009 1345.8691348 -0.15601837 0.13576046 9.532748e-02 2.327564e-01 ## PENG0000000010 4.4468600 0.04479898 0.29043812 6.594170e-01 1.000000e+00 ## PENG0000000012 671.1614707 -0.14329193 0.10518787 7.279070e-02 1.885944e-01 ## PENG0000000014 0.1394510 -0.01310322 0.42899817 8.004145e-01 1.000000e+00 ## PENG0000000015 770.7712490 0.70623622 0.15143560 1.487255e-07 2.841074e-06 ## PENG0000000016 5.3302466 0.14048684 0.33883106 2.315969e-01 1.000000e+00 ## PENG0000000022 1276.9182728 -0.06159326 0.07899988 3.184644e-01 5.180787e-01
There are a few key points you need to condider when deciding upon a method of visualisation:
relevance: the message a figure needs to convey
salience: how easily the eye can distringuish your message from the background
accuracy: how exactly different visualisation methods may convey your message
Especially when presenting (when the audience is listening and reading), it’s vital that salience and relevance are aligned.
Human vision is highly selective. We understand visual information by selecting, in turn, individual objects or aspects for detailed analysis rather than by appreciating an entire scene.
Formally, salience is the property of an object that sets it apart from its surroundings; it’s a relative properly, therefore, and depends on the collection of objects being visualised.
We can enhance salience by manipulating color, shape, size, and position to focus attention.
It’s not inherently different to make information salient …
… what’s important is that the salient points of a figure align with what’s relevant.
Nevertheless, it’s also easy to reduce salience by:
displaying too much information together
attempting to convey different points of relevance within the same image
referring to points of relevance that are tangential to what’s salient
When displaying a graphic, we want the viewer to be able to perceive the patterns and trends that convey the relevant point.
Humans are better able to interpret certain visual cues better than others, however - interpretation is subjective and everyone’s different! Let’s try and rank the following methods of conveying the same information:
Research in the field of visual theory has shown that, in order, people are best able to understand:
positions on a common aligned scale
positions on unaligned scales that are otherwise common
lengths
angles and slopes
area
volume and colour saturation
colour hue
Unfortunately, although it’s often a go-to method, colour is amongst the least reliable methods of conveying information!
This is worsened by the fact that colour perception is relative. We distinguish colours differently depending on their surroundings and can be easily tricked into seeing the same colour differently or into seeing different colours similarly.
Not only can be we be fooled into perceiving colours contextually, people differ in their ability to distinguish colours based on their genetics.
Colour blindness (colour vision deficiency or CVD) is a sliding scale and there are a number of different types. ‘Full’ phenotypes for the three more common forms of CVD are show here.
Across populations with Northern European ancestry, up to 1/12 males and 1/200 females have some level of red-green CVD. UK-wide, 4.5% of the population have some level of CVD and, even by the time they leave school, approximately 40% are unaware.
When producing figures that make use of colour, if we want them to be salient, it’s important to be inclusive! Thankfully, a number of colour scales have been developed that allow figures to retain salience for those with CVD.
The viridis library provides color maps for use in R that are:
colourful, spanning as wide a palette as possible
perceptually uniform, such that, across the whole range, nearby values have similar-appearing colors and distant values appear more distinct
friendly for those with colour vision deficiency
viridis can be installed from CRAN.
install.packages("viridis")
Although base R can produce a variety of plots, getting them to be ‘just so’ can be extremely difficult!
These days, virtually all ‘pretty’ plotting is done with the ggplot2 library.
install.packages("ggplot2")
Given its popularity, lots of libraries interface with ggplot2 (e.g. viridis). To complement ggplot2, we’ll also install patchwork, which helps to layout plots in rows and grids.
install.packages("patchwork")
library("ggplot2")
library("patchwork")
The central ggplot() function provides a consistent interface to map data to the aesthetics of geometries.
In other words:
the tabular data that we wish to plot …
… has various facets, which we map to the aesthetic properties we can perceive (position, colour, shape, size, and transparency) …
… of various geometric methods of displaying the data (bars, points, lines, etc)
Most commonly, we make a plot by passing the ggplot() function two arguments:
data, which takes a data frame (or something that can be coerced to one)
mapping, which uses the aes() function to define the aesthetic mappings we’d like to use
penguins = na.omit(read.csv("data/1_palmerpenguins.csv"))
penguins$species = factor(penguins$species)
penguins$island = factor(penguins$island)
penguins$sex = factor(penguins$sex)
penguins$year = factor(penguins$year, ordered=TRUE)
g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm)) class(g)
## [1] "gg" "ggplot"
Here, we’ve instantiated a ggplot object with the penguins data frame and defined two simple aesthetics - that the x axis should display bill_length_mm and that y should display bill_depth_mm.
Let’s have a look at what we’ve made.
g
Beautiful, we’re finished!
Remember that a ggplot requires data, aesthetics, and geometries. So far, we haven’t added any geometries to our plot. We can think of a ggplot as a painting; ggplot() provides the canvas and the geometries provide the layers of paint.
Geometries (geom_...() functions) added to a ggplot2 display data according to the aesthetics defined in the ggplot object. Geometries are added (literally, as we use the + operator) to the ggplot and we reassign the result to the original object variable name.
g = g + geom_point() g
Now we can see something!
What we have so far is a little simplistic! Let’s re-work the aesthetics by getting the ggplot object to colour according to the levels of the species factor (color and col are also valid if you’re not British).
g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=species)) + geom_point() g
The aesthetic properties defined in the ggplot() function call are set as the defaults for all geometries, provided they’re applicable. All geometries added will be passed these defaults.
g + geom_smooth(method="lm")
Here, the linear model (method="lm") trend lines we add using geom_smooth() inherit their colour aesthetic from the ggplot parent object.
Let’s make our first ggplot!
Set up a ggplot object of displaying flipper_length_mm against body_mass_g
Add a colour aesthetic to the plot
Add a geom_smooth(method="lm") geometry to the plot
What happens if we exchange geom_point() for geom_density2d()?
The more facets of the data we map to aesthetics, the more it’s partitioned.
g = ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=species, shape=sex)) g + geom_point() + geom_smooth(method="lm")
Here, the shape=sex aesthetic further subdivided the data by sex, giving 6 trend lines instead of 3.
We don’t always want our geometries to inherit all of the defaults passed to the ggplot() function, therefore! Thankfully:
individual geometries can have alternate aesthetics passed using the mapping= argument, which allows individual aesthetic parameters to be reset to NULL.
geom_point(mapping=aes(shape=NULL))
aesthetics can be manually specified for all data by passing the aesthetic parameter as an argument itself
geom_point(colour="black")
Here, we pass alpha=0.5 to geom_point() to manually set the alpha (transparency) of the points. Additionally, we override the shape aesthetic within geom_smooth() by setting it to NULL so that we don’t duplicate our trend lines.
g = g + geom_point(alpha=0.5) + geom_smooth(mapping=aes(shape=NULL), method="lm") g
Some aesthetics are only available for specific geometries and, depending on the geometry, certain aesthetic options are more relevant or salient than others.
As a general guide, discrete variables are best visually separated with …
shape (for points) for very small numbers of discrete groups and where overlapping is minimal
linetype (for lines) for very small numbers of discrete groups
colour (or fill for filled geometries) using separate hues
… whereas continuous variables are most accurately displayed using:
size or lineweight
a scaled gradient across distant colour hues
a scaled gradient across a single colour
alpha, if necessary
Separately to the aesthetics of the geometries, we can control other visual aspects of the plot by modifying the theme() defaults or by using one of the theme_...() presets. We can also use the labs() function to control the axis and plot titles.
g + theme_bw() + labs(x="Bill length (mm)", y="Bill depth (mm)", title="Graph with theme_bw()")
Let’s add some style to our ggplot.
Update your previous plot to use a theme preset. Try out theme_bw(), theme_classic(), and theme_minimal().
Even when using a theme preset, we can still override specific elements using theme(). What does adding theme(axis.text.y=element_text(angle=90, vjust=0.5, hjust=0.5)) achieve?
How might we rotate the labels for the x axis?
Looking at the help for element_text(), how might we change the size of the plot.title element?
There are many geometries available to help display data of different formats. The majority of graphing applications fall into five groups:
single continuous variable: geom_freqpoly(), geom_histogram(), geom_area(), geom_density()
single discrete variable: geom_bar()
two continuous variables: geom_point(), geom_smooth(), geom_rug(), geom_density_2d()
two variables, one continuous and one discrete: geom_boxplot(), geom_violin(), geom_dotplot(), geom_jitter(), geom_col()
two discrete variables: geom_count(), geom_jitter()
Let’s see a few options for plotting a continuous against a discrete variable.
Different geometries can be used to highlight - make salient - different aspects of the data. Here, geom_boxplot() better shows the position of the median value, whereas geom_violin() better highlights the spread of the data and geom_jitter() might do that too much!
Let’s see a few options for plotting two continuous variables against each other.
Here, as an alternative to geom_point(), geom_density2d() may better highlight the ‘centre of mass’ for each group. geom_rug() might be suitable alongside another geometry but otherwise lacks both accuracy and saliency.
Using this base …
ggplot(penguins, aes(x=body_mass_g, colour=species))
… let’s compare a couple of options for plotting a single continuous variable.
Add a geom_histogram() to the plot
Switch that for a geom_freqpoly()
Which plot has better accuracy and saliency?
Let’s apply some of our knowledge about using colour effectively … and inclusively!
The viridis library integrates easily with ggplot2 using various scale_colour_viridis_...() and scale_fill_viridis_...() functions:
scale_colour_viridis_d() makes salient mappings of discrete variables
scale_colour_viridis_c() makes accurately distinguishable mappings for continuous variables
scale_colour_viridis_b() merges the two - continuous variables are binned to enhance salience
Let’s start by replacing ggplot2’s default colour scheme for discrete variables using the scale_colour_viridis_d() function.
ggplot(penguins, aes(x=body_mass_g, y=flipper_length_mm, colour=species)) + geom_point() + theme_bw() + scale_colour_viridis_d()
Pretty easy!
As we saw earlier, viridis has a variety of palettes. Let’s compare a few on some discrete data by passing the option= argument to the scale_colour_viridis_d() function.
Using this base …
ggplot(penguins, aes(x=bill_length_mm, y=bill_depth_mm, colour=bill_length_mm/bill_depth_mm))
What are we doing in the colour aesthetic?
Make a graph making use of scale_colour_viridis_c()
Try a few different palettes by setting option=
Switch that out for scale_colour_viridis_b()
Within this nonsense example, which method has:
better accuracy?
better saliency?
Where we need to highlight two separate points within a plot, it’s common to incorporate the second factor using a separate aesthetic (e.g. as a symbol if we’re primarily using colour). However, mixing or increasing the complexity of aesthetics can lead to poor saliency!
g = ggplot(na.omit(penguins), aes(x=bill_length_mm, y=bill_depth_mm, colour=species)) + scale_colour_viridis_d(option="turbo") + theme_bw() g + geom_point(aes(shape=sex))
Often, a better alternative is to facet the data by the second factor.
facet_wrap() will use a factor (/factors) specified using formula notation to split the dataset into separate, grouped, plots.
g = g + geom_point() g + facet_wrap(~sex)
Alternatively, dor more complex layouts, we can use facet_grid() to produce a 2D grid.
g = g + facet_grid(rows=vars(sex), cols=vars(species)) g
For facet_grid(), it’s necessary to use ggplot2’s vars() function to extract the levels from the factors you’re faceting with.
Using this base …
ggplot(na.omit(penguins), aes(x=body_mass_g, colour=species))
geom_density() to display the distribution of body_mass_g or each speciesLet’s see if there’s variation in their weight by year.
facet_grid() that separates by year and sexThere are two ways of adjusting the axis scales in ggplot2:
setting everything outside of a specified window to NA. This is achieved by setting limits= within the relevant scale_...() function
scale_x_continuous(limits=c(1, 100))
zooming into a region without otherwise manipulating the data. This is achieved by setting xlim= and ylim= using the coord_cartesian() function
coord_cartesian(xlim=c(1, 100))
These differ in whether the non-visible points are used to contribute information to the plot:
limits prevents this. A geom_smooth() geometry will create a fit using only the points displayed
coord_cartesian() enables this. A geom_smooth() geometry will create a fit using all data, even if it’s not displayed.
Alongside the adjustment of the scales, it’s often useful to be able to add reference or threshold lines to a plot to enhance saliency.
ggplot2 has three geometries that allow this:
geom_hline(), taking a yintercept argument
geom_vline(), taking a xintercept argument
geom_abline(), taking slope and intercept arguments
g = ggplot(na.omit(penguins), aes(x=species, y=body_mass_g, colour=species)) + geom_violin(fill=NA) + geom_hline(yintercept=3000, linetype="dashed") + theme_bw() g
It’s also possible to add text annotations to a plot using annotate() elements, using the x and y arguments to position the text.
g = g + annotate("text", x=3.3, y=2900, label="Underweight")
g
For the purposes of display (and for positioning labels), categorical data, as above, are converted to sequences of integers (1, 2, 3, …) on the x axis.
It’s also possible to label specific points within a plot using geom_text() or geom_text_repel() from the ggrepel package, which repels labels from each other to prevent overlap.
Where there are only a few points, this works without further effort. Where there are lots, it’s more common to be needing to specifically label a specific subset. To do this, we supply new data to the geometry.
g + geom_text(aes(label=id), data=penguins[sample(nrow(penguins), 5), ])
Saving ggplot objects is simpler than for base R graphics.
ggplot2 has a special ggsave() function that will adjust width and height in a variety of units and save in 10 different formats!
ggsave( "outputs/4_ggsave.png", g, height=10, width=15, units="cm")
Here, we pass a specific ggplot object to save but, by default, it saves the last displayed plot.
The ggsave() function is also clever enough to guess the required device (png, svg, pdf, etc) from the file extension given in the path name; we can specify this for clarity but it’s not necessary.
There are more penguin-related homework tasks to help cement what we’ve covered today!
The homework and instructions can be found within the main directory for the course: ./homework/Homework_4.Rmd